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Foreword 



rd , 



This Technical Specification has been produced by the 3 Generation Partnership Project (3GPP). 

The contents of the present document are subject to continuing work within the TSG and may change following formal 
TSG approval. Should the TSG modify the contents of the present document, it will be re-released by the TSG with an 
identifying change of release date and an increase in version number as follows: 

Version x.y.z 

where: 

X the first digit: 

1 presented to TSG for information; 

2 presented to TSG for approval; 

3 or greater indicates TSG approved document under change control. 

y the second digit is incremented for all changes of substance, i.e. technical enhancements, corrections, 
updates, etc. 

z the third digit is incremented when editorial only changes have been incorporated in the document. 
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Scope 



The present document specifies two alternatives for the Voice Activity Detector (VAD) to be used in the Discontinuous 
Transmission (DTX) as described in [3]. Implementors of mobile station and infrastructure equipment conforming to 
the AMR specifications can choose which of the two VAD options to implement. There are no interoperability factors 
associated with this choice. 

The requirements are mandatory on any VAD to be used either in User Equipment (UE) or Base Station Systems 
(BSS)s that utihze the AMR speech codec. 



References 



The following documents contain provisions which, through reference in this text, constitute provisions of the present 
document. 

• References are either specific (identified by date of publication, edition number, version number, etc.) or 
non-specific. 

• For a specific reference, subsequent revisions do not apply. 

• For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including 
a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same 
Release as the present document. 

[1] 3GPP TS 26.073: "Adaptive Multi-Rate (AMR); ANSI C source code". 

[2] 3GPP TS 26.090: "Transcoding functions". 

[3] 3 GPP TS 26.093: "Source Controlled Rate operation". 

[4] ITU, The International Telecommunications Union, Blue Book, Vol. Ill, Telephone Transmission 

Quality, IXth Plenary Assembly, Melbourne, 14-25 November, 1988, Recommendation G.7II, 
Pulse code modulation (PCM) of voice frequencies. 



3 Technical Description of VAD Option 1 

3.1 Definitions, symbols and abbreviations 

3.1.1 Definitions 

For the purposes of the present document, the following terms and definitions apply: 

frame: time interval of 20 ms corresponding to the time segmentation of the speech 
transcoder 

3.1.2 Symbols 

For the purposes of the present document, the following symbols apply. 

3.1.2.1 Variables 

bckr_est[n] background noise estimate 

burst_count counts length of a speech burst, used by VAD hangover addition 
hang_count hangover counter, used by VAD hangover addition 
complex_hang_count hangover counter, used by CAD hangover addition 
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complex_hang_timer hangover initator, used fo Complex Activity Estimation 

lagcount pitch detection counter 

level[n] signal level 

new_speech pointer of the speech encoder, points a buffer containing last received samples of a speech frame 

[2] 

noisejevel average level of the background noise estimate 

oldlagcount lagcount of the previous frame 

pitch flag indicating presence of a periodic signal 
complex_warning flag indicating the presence of a complex signal. 

best_corr_hp normalized and limited value from maximum HP filtered correlation vector 

corr_hp filtered best_corr_hp values 

pow_suin power of the input frame 

s(i) samples of the input framer 

snr_suin measure between input frame and noise estimate 

stat_count stationarity counter 

stat_rat measure indicating stationary 

T_op[n] open-loop lags [2] 

to autocorrelation maxima calculated by the open-loop pitch analysis [2] 

tl signal power related to the autocorrelation maxima tO [2] 

tone flag indicating the presence of a tone 

vad_thr VAD threshold 

VAD_flag boolean VAD flag 

vadreg intermediate VAD decision 

complexjow intermediate complex signal decisions 

complex_high intermediate complex signal decisions 



3.1.2.2 



Constants 



ALPHA_UP1 

ALPHA_DOWNl 

ALPHA_UP2 

ALPHA_DOWN2 

ALPHA3 

ALPHA4 

ALPHAS 

BURST_LEN_HIGH_NOISE 

BURST_LEN_LOW_NOISE 

COEFF3 

COEFF5_l 

COEFF5_2 

HANG_LEN_HIGH_NOISE 

HANG_LEN_LOW_NOISE 

HANG_NOISE_THR 

L_FRAME 

L_NEXT 

LTHRESH 

NOISE_MAX 

NOISE_MIN 

NTHRESH 

POW_PITCH_THR 

POW_COMPLEX_THR 

STAT_COUNT 

CAD_MIN_STAT_COUNT 

STAT_THR 

STAT_THR_LEVEL 

TONE_THR 

VAD_P1 

VAD_POW_LOW 

VAD_SLOPE 

VAD THR HIGH 



constant for updating noise estimate (see clause 3.3.5.2) 

constant for updating noise estimate (see clause 3.3.5.2) 

constant for updating noise estimate (see clause 3.3.5.2) 

constant for updating noise estimate (see clause 3.3.5.2) 

constant for updating noise estimate (see clause 3.3.5.2) 

constant for updating average signal level (see clause 3.3.5.2) 

constant for updating average signal level (see clause 3.3.5.2) 

constant for controlling VAD hangover addition (see clause 3.3.5.1) 

constant for controlling VAD hangover addition (see clause 3.3.5.1) 

coefficient for the filter bank (see clause 3.3.1) 

coefficient for the filter bank (see clause 3.3.1) 

coefficient for the filter bank (see clause 3.3.1) 

constant for controlling VAD hangover addition (see clause 3.3.5.1) 

constant for controlling VAD hangover addition (see clause 3.3.5.2) 

constant for controlling VAD hangover addition (see clause 3.3.5.2) 

size of a speech frame, 160 

length for the lookahead of the speech encoder, 40 

threshold for pitch detection (see clause 3.3.2) 

maximum value for noise estimate (see clause 3.3.5.2) 

minimum value for noise estimate (see clause 3.3.5.2) 

threshold for pitch detection (see clause 3.3.2) 

threshold for pitch detection (see clause 3.3.5) 

threshold for complex detection (see clause 3.3.5) 

threshold for stationary detection (see clause 3.3.5.2) 

minimum threshold after complex warning 

threshold for stationary detection (see clause 3.3.5.2) 

threshold for stationary detection (see clause 3.3.5.2) 

threshold for tone detection (see clause 3.3.3) 

constant of computation for VAD threshold (see clause 3.3.5.2) 

constant for controlling VAD hangover addition (see clause 3.3.5.1) 

constant of computation for VAD threshold (see clause 3.3.5) 

constant of computation for VAD threshold (see clause 3.3.5) 



£75/ 



3GPP TS 26.094 version 10.0.0 Release 10 



ETSI TS 126 094 VI 0.0.0 (2011-04) 



CVAD_THRESH_ADAPT_HIGH 

CVAD_THRESH_ADAPT_LOW 

CVAD_THRESH_HANG 

CVAD_HANG_LIMIT 

CVAD HANG LENGTH 



constant for updating complex_high 
constant for updating complexjow 
constant for updating complex_hang_timer 
constant for initiating complex_hang_count 
constant for resetting complex_hang_count 



3.1.2.3 




Functions 


+ 




addition 


- 




subtraction 


* 




multiplication 


/ 




division 


Ixl 




absolute value of x 


AND 




Boolean AND 


OR 

h 




Boolean OR 


n) 


= x{a) + x{a + l 


ti—a 




x,x<y 


MIN(x 


:,y) 





MAX(x,y) 



3.1.3 Abbreviations 



For the purposes of the present document, the following abbreviations apply: 



ANSI 

DTX 

VAD 

CAD 

CNG 



American National Standards Institute 
Discontinuous Transmission 
Voice Activity Detector 
Complex Activity Detection 
Comfort Noise Generation 



3.2 General 



The function of the VAD algorithm is to indicate whether each 20 ms frame contains signals that should be transmitted, 
i.e. speech, music or information tones. The output of the VAD algorithm is a Boolean flag (VAD_flag) indicating 
presence of such signals. 



3.3 Functional description 



The block diagram of the VAD algorithm is depicted in figure 1. The VAD algorithm uses parameters of the speech 
encoder to compute the Boolean VAD flag (VAD_flag). Samples of the Input frame (s(i)) are divided into sub-bands 
and level of the signal in each band (level[n]) is calculated. Input for the pitch detection function are open-loop lags 
(T_op[n]), which are calculated by open-loop pitch analysis of the speech encoder. The pitch detection function 
computes a flag (pitch) which indicates presence of pitch. Tone detection function calculates a flag (tone), which 
indicates presence of an information tone. Tones are detected based on pitch gain of the open-loop pitch analysis The 
pitch gain is estimated using autocorrelation values (tO and tl) received from the pitch analysis. Complex Signal 
Detection function calculates a flag (complex_warning), which indicates presence of a correlated complex signal such 
as music. Correlate complex signals are detected based on analysis of the correlation vector available in the open-loop 
pitch analysis. The VAD decision function estimates background noise levels. Intermediate VAD decision is calculated 
based on the comparison of the background noise estimate and levels of the input frame (level[n]). Finally, the VAD 
flag is calculated by adding hangover to the intermediate VAD decision. 
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Figure 3.1 : Simplified block diagram of the VAD algorithm: Option 1 

3.3.1 Filter bank and computation of sub-band levels 

The input signal is divided into frequency bands using a 9-band filter bank (figure 3.2). Cut-off frequencies for the filter 
bank are shown in table 3.1. 

Table 3.1. Cut-off frequencies for the filter bank 



Band number 


Frequencies 


1 


- 250 Hz 


2 


250 - 500 Hz 


3 


500 - 750 Hz 


4 


750 -1000 Hz 


5 


1000 -1500 Hz 


6 


1500 -2000 Hz 


7 


2000 - 2500 Hz 


8 


2500 - 3000 Hz 


9 


3000 - 4000 Hz 



Input for the filter bank is the speech frame pointed by the new_speech pointer of the speech encoder [1]. Input values 
for the filter bank are scaled down by one bit. This ensures safe scaling, i.e. saturation can not occur during calculation 
of the filter bank. 
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5th order 
filter block 



_^ 3k -4 kHz 



5th order 
filter block 



5th order 
filter block 



3rd order 
filter block 



3rd order 
filter block 



^ 2.5 -3 kHz 
-* 2 -2.5 kHz 



3rd order 
filter block 



3rd order 
filter block 



3rd order 
filter block 



•• 1.5 -2 kHz 
•• 1 - 1.5 kHz 

>-750- 1000 Hz 

*■ 500 - 750 Hz 

*250 - 500 Hz 
♦• 0-250 Hz 



(3.1a) 
(3.1b) 



Figure 3.2: Filter bank 

The filter bank consists of 5* and 3'^'' order filter blocks. Each filter block divides the input into high-pass and low-pass 
parts and decimates the sampling frequency by 2. The 5* order filter block is calculated as follows: 

where 

x(i) input signal for a filter block 

Xi (i) low-pass component 

■^hn (') high-pass component 

The 3"* order filter block is calculated as follows: 

x,^(i)^0.5*(x(i) + A,(x(i-m 

x,^(i) = 0.5*(x(i)-A,(x(i-m 



(3.2a) 



(3.2b) 
The filters Aj () ,^20, and A^ () are first order direct form all-pass filters, whose transfer function is given by: 



A(Z): 



C + z' 



l + C*z-\ (3.3) 

where C is the filter coefficient. 

Coefficients for the all-pass filters A; () ,^20, and A3 () are COEFF5_l, COEFF5_2, and COEFF3, respectively. 
Signal level is calculated at the ouput of the filter bank at each frequency band as follows: 



level(n)= ^|-x:„(/)| 



i=START 



(3.4) 
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where: 

n index for the frequency band 

x„ (/) sample i at the output of the filter bank at frequency band n 

START 



- 2, n < 4 

- 4, 5 < n < 8 
-8, n=9 



END^ 



9, n<4 
19, 5<n<8 
39, n=9 



Negative indices of X^(i) refer to the previous frame. 

3.3.2 Pitch detection 

The purpose of the pitch detection function is to detect vowel sounds and other periodic signals. The pitch detection is 
based on comparison of open-loop lags (T_op[n]), which are calculated by the speech encoder [2]. If the difference of 
consecutive open-loop lags (T_op[n]) is smaller than a threshold, lagcount is incremented. If the sum of the lagcounts of 
two consecutive frames is high enough, the pitch flag is set. For 5.15 and 4.75 kbit/s rates, only one open-loop lag is 
calculated, and therfore only the first lag-comparison is made every frame. The pitch flag is calculated as follows: 

Lagcount = 0; 

If ( I T_op[-l] - T_op[0] I < LTHRESH) 

Lagcount = Lagcount + 1 
If ( I T_op[0] - T_op[l] I < LTHRESH) 

Lagcount = Lagcount + 1 
If (Lagcount + oldlagcount >= NTHRESH) 

pitch = 1 
else 

pitch = 
oldlagcount = Lagcount 
T_op[-l] refers to the open-loop lag of the previous frame. 

3.3.3 Tone detection 

Tone detection is used to detect information tones, since the pitch detection function can not always detect these signals. 
Also, other signals which contain very strong periodic component are detected, because it may sound annoying if these 
signals are replaced by comfort noise. If the open-loop pitch gain is higher than the constant TONE_THR, tone is 
detected and tone flag is set. The pitch gain can be tested by comparing variables tO and tl as follows: 

if(tO>TONE_THR*tl) 

tone = 1 

The speech encoder calculates the pitch in three delay ranges, except for mode 10.2 kbit/s, where only one range is 
used. The above comparison is made once for each delay range and the tone flag should be set if the condition is true at 
least in one range. Otherwise, the tone flag should be set to zero. 
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The variables tO and tl are calculated by the open-loop pitch analysis of the speech encoder [2]. The variable tO is 
autocorrelation maxima given by: 

(3.5) 
The variable tl is the signal power related to the autocorrelation maxima tO at the delay value k: 

(3.6) 

The open-loop pitch search and correspondingly the tone flag is computed twice in each frame, except for modes 5.15 
kbit/s and 4.75 kbit/s, where it is computed only once. 

3.3.4 Correlated Complex Signal Analysis (and detection) 

Correlated complex signal detection is used to detect correlated signals in the highpass filtered weighted speech 
domain, since the pitch and tone detection functions can not always detect these signals. Signals which contain very 
strong correlation values in the high pass filtered domain are taken care of, because it may sound really annoying if 
these signals are replaced by comfort noise. If the statistics of the maximum normalized correlation value of a high pass 
filtered input signal indicates the presence of a correlated complex signal a flag complex jwarning is set. To reduce 
complexity the high band correlation analysis is performed in a simplified manner by analysing the high pass filtered 
fullband correlation vector which is available from the OL-LTP analysis performed by the speech encoder at least once 
in each frame. 

best_corr_hp„j is the maximum normalized value of the high pass filtered correlation in the range 19-146 limited to be 
in the range [1.0, 0.0]. (Note that the best_corr_hp value is delayed one frame). The high pass filter is a simple first 
order filter with coefficients [1, -1] The best_corr_hp value is filtered according to : 

corr _ hp^^^ = (alpha) * corr _ hp^^ + (1 - alpha) * best _ corr _ hp^ 

where alpha is varied between 0.98 and 0.8 as a function of corr_hp„ and best_corr_hp„ 

The corrjip output value is thresholded into two to registers complexjiigh, complex_low and one counter 
complex_hang_timer. 

complex_low is set to 1 if the corrJip value is greater than CVAD_THRESH_AD APT_LOW. 

complexjiigh is set to 1 if the corrJip value is greater than CV AD_THRESH_ADAPT_HIGH. 

complex_hang_timer is increased by 1 if the corrJip value is greater than CVAD_THRESH_HANG. If the corrJip 
value is lower than or equal to CVAD_THRESH_HANG the complex_hang_timer value is set to 0. 

The flag complex_warning is set if complex_low have been set for 15 consecutive frames or complexjiigh has been set 
for 8 consecutive frames. 

The open-loop pitch search and correspondingly the tone flag is computed twice in each frame, except for modes 5.15 
kbit/s and 4.75 kbit/s, where it is computed only once. The computation of the corr_hp value is however always done 
only once per frame using the newest correlation vector available. 

3.3.5 VAD decision 

Power of the input frame is calculated as follows: 

L _ FRAME -L _ NEXT -I 

pow _sum= ^ sji) * s{i) 

i=-L_NEXT /o -J) 

where samples s(i) of the input frame are pointed by the new_speech pointer of the speech encoder. If the power of the 
input frame (pow_sum) is lower than the constant POW_PITCH_THR, last pitch flag is set to zero. If the power of the 
input frame (pow_sum) is lower than the constant POW_COMPLEX_THR, last complexjow flag is set to zero. 
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The difference between the signal levels of the input frame and background noise estimate is calculated as follows: 

v^ ,^^ v/1 r. level[n] ^2 

snr _ sum = > MAX (1 .0, -^^) 

„^i bckr_est[n\ ^ ^3 3^ 

where: 

level[n] signal level at band n 

bckr_est[n] level of background noise estimate at band n 

VAD decision is made by comparing the variable snr_sum to a threshold. The threshold (vad_thr) is tuned to get desired 
sensitivity at each background noise level. The higher the noise level the lower is the threshold. Specially, a low 
threshold at high-level background noise is needed to detect speech reliably enough, although probability of detecting 
noise as speech also increases. 

Average level of background noise is calculated by adding noise estimates at each band: 

9 
noise _ level - ^ bckr _ est\n\ 

"=i (3.9) 

Threshold is calculated using average noise level as follows: 

vad _ thr = VAD _ SLOPE * {noise _ level - VAD _PI) + VAD _ THR _ HIGH ,3 ^ q. 

where VAD_SLOPE, VAD_P1, and VAD_THR_HIGH are constants. 

The variable vadreg indicates intermediate VAD decision and it is calculated as follows: 

if (snr_sum > vad_thr) 

vadreg = 1 

else 

vadreg = 

3.3.5.1 Hangover addition 

Before the final VAD flag is given, a hangover is added. The hangover addition helps to detect low power endings of 
speech bursts, which are subjectively important but difficult to detect. Also a long hangover is added if the signal has 
been found to be of very complex nature for a long time (2 seconds) since the VAD is not likely to work reliably for 
such a complex signal. 

VAD flag is set to "1" if less that hangjen frames with "0" decision have been elapsed since burstjen consecutive "1" 
decisions have been detected. The variables hang_len and burstjen are set depending on the average noise level 
(noise_level). The vadjlag is also controlled by the complex Jiang _count which indicates that the signal is too 
complex for the VAD and should not be used with a Comfort noise generation algorithm. The filtered correlation value 
corrjip is also used as an activity indication after the VAD has indicated noise for a while (during 200 ms), this will 
aid in situations where the VAD noise estimate has adapted to a rather stationary but still all to complex signal to make 
it sound well with CNG. 

The power of the input frame is compared to a threshold (VAD_POW_LOW). If the power is lower, the VAD flag is set 
to "0" and no hangover is added. The VAD_flag is calculated as follows: 

if (noisejevel > HANG_NOISE_THR) 

burstjen = BURST_LEN_HIGH_NOISE 

hangjen = HANG_LEN_HIGH_NOISE 

else 
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burstjen = BURST_LEN_LOW_NOISE 
hangjen = HANG_LEN_LOW_NOISE 
if(complex_hang_timer > CVAD_HANG_LIMIT) { 

if(complex_hang_count < CVAD_HANG_LENGTH { 
complex_hang_count = CVAD_HANG_LENGTH; 

} 
} 

if (powsum < VAD_POW_LOW){ 
burst_count = 
hang_count = 
complex_hang_count = 0; 
complex_hang_timer = 0; 
Vad_flag=0; 
Goto Exit; 

} 

VAD_flag=0; 

if(complex_hang_count != 0){ 

burst_count = BURST_LEN_HIGH_NOISE; 
complex_hang_count = complex_hang_count - 1 ; 
VAD_flag=l; 
goto Exit 
} else { 

if ( (the 10 last out of 1 1 vadreg values all are zero) AND 
(corr_hp > CVAD_THRESH_IN_NOISE ) ) { 
VAD_flag=l; 
Goto Exit 
} 
} 

if (vadreg = 1){ 

burst_count = burst_count + 1 } 
if (burst_count >= burstjen) { 
hang_count = hangjen 

} 
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VAD_flag = 1 
} else { 

burst_count = 
if (hang_count > 0) { 
hang_count = hang_count - 1 
VAD_flag=l 
} 
} 
Label Exit 

3.3.5.2 Background noise estimation 

Background noise estimate (bckr_est[n]) is updated using amplitude levels of the previous frame. Thus, the update is 
delayed by one frame to avoid undetected start of speech bursts to corrupt the noise estimate. If the internal VAD 
decision is "1" or if pitch has been detected, the noise estimate is not updated upwards. The update speed for the current 
frame is selected as follows: 

if ((vadreg for the last 4 frames has been zero) AND 

(pitch for the last 4 frames has been zero) AND 

(we are not in complex signal hangover)) 

alpha_up = ALPHA_UP1 

alpha_down = ALPHA_DOWNl 

else 

if ((stat_count = ) AND (not in complex_signal hangover)) 

alpha_up = ALPHA_UP2 

alpha_down = ALPHA_DOWN2 

else 

alpha_up = 

alpha_down = ALPHA3 

The variable stat_count indicates stationary and its propose is explained later in this clause. The variables alpha_up and 
alpha_down define the update speed to upwards and downwards. The update speed for each band n is selected as 
follows: 

.^^bckr_est,„[n] ^ level ^_\n\^ 

alpha = alpha_up 

else 

alpha = alpha_down 

Finally, noise estimate is updated as follows: 

bckr _ est^^^ \n\ = (1.0- alpha) * bckr _ est^ \n\ + alpha * level^_^ [n] /o 1 1 \ 
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where: 

n index of the frequency band 

m index of the frame 

Level of the background estimate (bckr_est[n]) is Hmited between constants NOISE_MIN and NOISE_MAX. 

If level of background noise increases suddenly, vadreg will be set to "1" and background noise is not updated upwards. 
To recover from this situation, update of the background noise estimate is enabled if the intermediate VAD decision 
(vadreg) is "1" for enough long time and spectrum is stationary. Stationary (stat_rat) is estimated using following 
equation: 

^ MAX (STAT_THR_LEVEL, MAX(ave _ level ^ [n\ level^ [n])) 

Stat rat = / tt n — 

tt MAX (STAT_THR_LEVEL, M]N(ave _ level „ [n], level „, [n])) ,3 ^ ^. 

If the stationary estimate (stat_rat) is higher than a threshold, the stationary counter (stat_count) is set to the initial value 
defined by constant STAT_COUNT. The stationary counter (stat_count) is also initialised if pitch or tone or a 
complex_warning is detected. If the signal is not stationary but speech has been detected (VAD decision is " 1 "), 
stat_count is decreased by one in each frame until it is zero. 

if (complex _warning){ 

If(stat_count < CAD_MIN_STAT_COUNT) 

Stat count = CAD MIN STAT COUNT 



if ( (8 last vadreg flags have been zero) OR (2 last pitch flags have been one) OR (5 last tone flags have been one) ) 

stat_count = STAT_COUNT 
else 

if (stat_rat > STAT_THR) 
stat_count = STAT_COUNT 
else 

if ((vadreg) AND (stat_count t^ 0)) 
stat_count = stat_count - 1 

The average signal levels (ave_level[n]) are calculated as follows: 

ave _ level^^y \n\ = (1 .0 - alpha) * ave _ level ^ \n\ + alpha * level^ \n\ ,_ . _, 

{6.\o) 

The update speed (alpha) for the previous equation is selected as follows: 

if (stat_count = STAT_COUNT) 

alpha = 1.0 

else if (vadreg =1) 

alpha=ALPHA5 

else 

alpha = ALPHA4 



£75/ 



3GPP TS 26.094 version 10.0.0 Release 10 16 ETSI TS 126 094 V10.0.0 (2011-04) 

4 Technical Description of VAD Option 2 

4.1 Definitions, symbols and abbreviations 

4.1.1 Definitions 

For the purposes of the present document, the following terms and definitions apply: 

codec: combination of an encoder and decoder in series (encoder/decoder) 

compand: process of compressing and expanding a signal. In this text, the process is described in terms of PCM [4] 

Decoder: generally, a device for the translation of a signal from a digital representation into an analog format. For the 
present document, a device which converts speech encoded in the format specified in the present document to analog or 
an equivalent PCM representation 

DFT: see Discrete Fourier Transform 

Discrete Fourier Transform (DFT): method of transforming a time domain sequence into a corresponding frequency 
domain sequence 

Encoder: generally, a device for the translation of a signal into a digital representation. For the present document, a 
device which converts speech from an analog or its equivalent PCM representation to the digital representation 
described in the present document 

Fast Fourier Transform (FFT): efficient implementation of the Discrete Fourier Transform 

FFT: see Fast Fourier Transform 

Vocoder:voice coder 

frame: time interval of 20 ms corresponding to the time segmentation of the speech transcoder 

4.1.2 Symbols 

For the purposes of the present document, the following symbols apply. 

4.1.2.1 Variables 

aci,(m) channel energy smoothing factor 

a(m) exponential windowing factor 

AE(m) estimated spectral deviation between current power spectrum and average long term power 

spectral estimate 
(|)(m) spectral peak- to -average ratio 

^^*'' quantized channel SNR indices 

b(m) burst count 

bth burst count threshold 

{ d(m) } overlapped portion of the frame buffer of input samples 

E^^(m,i) channel energy estimate; channel i, subframe m 

Ej.;,(ot) vector of channel energy estimates, < i < N^ 

EdB(m,i) estimated log power spectrum 

ErffiCffi) vector of log power spectrum estimates, < i < Nc 

^de''^'' average long term power spectral estimate 

^dB^ vector of average long term power spectral estimates, < i < N(. 

E„(m,0 channel noise estimate 

E„(m) vector of channel noise estimates, < i < N;. 

Etn(m) total estimated noise energy 

E,g,(m) total channel energy 
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E toiifn) modified total channel energy 

h(m) hysteresis counter 

hc-n, hangover count 

h„(n) overlap-and-add buffer of samples 

hyster_cnt hysteresis counter to avoid long term creeping of update_cnt 

last_update_cnt previous value of update_cnt 

Si^pin) sample at the output of the speech encoder high pass filter 

sinewavejlag boolean flag, set TRUE when spectral peak-to-average ratio is greater than lOdB and the spectral 

deviation is less than DEV_THLD 

SNR Signal to Noise ratio 

SNRp(m) long-term peak SNR 

SNRg(m) quantized version of SNRp(m) 

update_cnt counter gating noise estimate update process 

update Jlag flag controlling noise estimate updating 

VAD(m) boolean VAD flag for subframe m 

VAD_flag boolean VAD Flag 

v(m) sum of voice metrics 

Vtji voice metric threshold 



4.1.2.2 



Constants 



an 
a„ 

btable 

D 
DEV_THLD 

^floor 

E„ 
F 

El 

P 

fH 
fi 

g(n) 
G(k) 

htable 

HYSTER_CNT_THLD 

L 

M 

Nc 

NOISE_FLOOR_D 

UPDATE_CNT_THLD 

UPDATE_THLD 

V 

Vtable 



upper limit for values of ce(m) 
lower limit for values of C((m) 
channel noise smoothing factor 
pre-emphasis factor 
table to generate bth 
overlap (delay) in sample intervals 
threshold for setting sinewavejlag 
low threshold for E,„,{m) 

high energy endpoint for linear interpolation of E,„,(m) 

minimum allowable channel noise initialisation energy 

low energy endpoint for linear interpolation of E,„,(m) 

minimum allowable channel energy 

high channel combining table 

low channel combining table 

trapezoidal window, n = to M 

frequency domain transformation of g(n) 

table to generate hct 

threshold for hyster_cnt 

subframe length in samples 

DFT sequence length 

number of combined channels 

low threshold for Em{m) in dB 

threshold for update_cnt 

threshold for v(m) 

voice metric table 

table to generate Vth 



4.1.2.3 



/ 

AND 
OR 

h 



Functions 

addition 

subtraction 

multiplication 

division 

largest integer < x 

Boolean AND 
Boolean OR 



: x{a) + x{a + \) + ... + x{b-\) + x{b) 
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4.1.3 Abbreviations 

For the purposes of the present document, the following abbreviations apply: 



ANSI 
DTX 
VAD 
CAD 
CNG 



American National Standards Institute 
Discontinuous Transmission 
Voice Activity Detector 
Complex Activity Detection 
Comfort Noise Generation 



4.2 General 

The function of the VAD algorithm is to indicate whether each 20 ms frame contains signals that should be transmitted, 
i.e. speech, music or information tones. The output of the VAD algorithm is a Boolean flag (VAD_flag) indicating 
presence of such signals. 



4.3 Functional description 



The block diagram of the VAD algorithm is depicted in figure 4. 1 . The VAD algorithm uses parameters of the speech 
encoder to compute the Boolean VAD flag (VAD_flag). 



Si,p[n] 



F r eq uen cy 

D om a in 
C onver sion 



Gik] 



S pectra I 

D evia tion 

E stim ator 



Peak-tD- 

Average 

Ratio 



1 



Channel 

E n er gy 

E s tim ator 



Echim) 



Channel 

SNR 
E s tim ator 



V oice 

M etr ic 

C a Ic ula tion 



E„{m) 



B ac k gr ou n d 

Noise 
E s tim ator 



K} 



'u pda tejiag 



Etatim], Atim] 



1 



Noise 
U pda te 
D ecis ion 



Etn(m) 



v(m) 



_ _ J 

fupdatejiag 



Etot(m) 



t t 



VAD 



VADJag 



Figure 4.1 : Block Diagram of the VAD algorithm: Option 2 

Input: 

The output of the High-Pass Filter, {s^p(n)} 

LTPJlag is generated by the comparison of the long-term prediction gain to a constant threshold LTP_THLD, 
where the long-term prediction gain y^is derived from the speech encoder[2] open-loop pitch predictor. 



Output: 



The output of the vad is designated as VAD_flag 
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Initialization: 

The following variables shall be set to zero at initialization (frame m = 0): 

The pre-emphasis memory 
The following shall be initialized to a startup value other than zero: 
The channel energy estimate, Er;,(m), (see clause 4.3.2) 

The long-term power spectral estimate, ^<ib ''"' , (see clause 4.3.5) 

The channel noise estimate, E„(m), (see clause 4.3.8) 

Processing: The following procedures shall be executed two times per 20 ms speech frame and the current 10 ms 
subframe shall be denoted m. 

4.3.1 Frequency Domain Conversion 

The input signal is pre-emphasised and windowed prior to frequency domain conversion. This process is defined as: 
d{n) = s,p{n) + CpS,p{n-l), 0<n<L, (4.1) 

where d{ti) is the pre-emphasised speech buffer, ^, is the pre-emphasis factor, and L is the subframe length. A 
rectangular window is then used to frame the speech prior to frequency domain conversion, which is expressed as: 

f 0, 0<n<D,L + D<n<M 

g(n) = < , (4.2) 

[d(n-D), D<n<L + D 

where D is the zero-padding offset into the DFT buffer, and M is the DFT length. The transformation of g{n) to the 
frequency domain is performed using the Discrete Fourier Transform (DFT) defined^ as: 

G(k)^ — Y^g(n)e-'^'^'"', 0<k<M (4.3) 

where e""is a unit amplitude complex phasor with instantaneous radial position m 

4.3.2 Cinannel Energy Estimator 

Calculate the channel energy estimate Ec;,(m) for the current subframe, m, as: 



1 " 2 

E,, (m, = max £^„ , a^.^ {m)E^^ (m - 1, /) + (l - a^, {m)) ^ \G{k)\ , < / < A^, 



(4.4) 



where £„,;„ is the minimum allowable channel energy, achim) is the channel energy smoothing factor (defined below), 
Nc is the number of combined channels, axiA f[(j) and/^CO ^6 the /-th elements of the respective low and high channel 
combining tables. 

The channel energy smoothing factor, a^^(m), is defined as: 

To, m<l 
0.45, m>l 



^ This atypical definition is used to exploit the efficiencies of the complex Fast Fourier Transform (FFT). The 2//W scale factor results 
from preconditioning the /W point real sequence to form an M/2 point complex sequence that is transformed using an M/2 point 
complex FFT. Details on this technique can be found in Proakis, J. G. and Manoiakis, D. G., Introduction to Digital Signal 
Processing, New York, Macmillan, 1988, pp. 721-722. 
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So, this means that o:^i,(m) assumes a value of zero for the first frame (m = 1) and a value of 0.45 for all subsequent 
frames. This allows the channel energy estimate to be initialized to the unfiltered channel energy of the first frame. 

4.3.3 Channel SNR Estimator 



Estimate the channel SNR vector { <T } as: 



a(i) = lOlog 



10 






0<i<N^ 



(4.6) 



where E„(m) is the current channel noise energy estimate (see clause 4.3.8), and then quantify the channel SNR estimate 
in 3/8 dB steps to yield the channel SNR indices { aq } given as: 



a^ (i) = max{0, min{89, round{c7(/) / 0.375}}}, 0<i<N^ 



(4.7) 



where the values of { o^ } are constrained to be between and 89, inclusive. 

4.3.4 Voice Metric Calculation 

Next, calculate the sum of voice metrics as: 

N,-l 



v(m)=Y,v(a^(i)), 



(4.8) 



=0 



where V(k) is the k value of the 90 element voice metric table V. 



4.3.5 Frame SNR and Long-Term Peak SNR Calculation 

The instantaneous frame SNR, SNR, and long-term peak SNR, SNRp(m), are used to calibrate the responsiveness of the 
VAD decision. When the frame count is less than or equal to four (m < 4) or the forced update flag (sec 4.3. 10) is set 
(fupdate_flag == TRUE), then the SNR"s are initialized as: 



/■/v,-i \ 

SNR^(m) = SNR = 56 -lOlog^Q ^£„(m,0 



(4.9) 



Otherwise, the instantaneous frame SNR is generated by: 



SNR = lOlog 



10 



A 1 N,.-l 



i)/m 



and the long-term peak SNR is derived by the following expression: 



SNR^im)-- 



0.9SNR(m-l) + 0.lSNR, 



SNR>SNR(m-l) 



0.99SSNR^ (m - 1) + 0.002SNR, O.eiSSNR^ (m - 1) < SNR < SNR^ (m - 1) . 



SNR(m-l), 



otherwise 



The long-term peak SNR is then quantized in 3 dB steps and limited to be between and 19, as follows: 
SNR^ = max|min|_5A^i?^ (m) /s] 19} o} 
where I x I is the largest integer < x (floor function). 



(4.10) 



(4.11) 



(4.12) 
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4.3.6 Negative SNR Sensitivity Bias 

In order for the VAD decision to overcome the problem of being over-sensitive to fluctuating, non-stationary 
background noise conditions, a bias factor is used to increase the threshold on which the VAD decision is based. This 
bias factor is derived from an estimate of the variablility of the background noise estimate. The variability estimate is 
further based on negative values of the instantaneous SNR. It is presumed that a negative SNR can only occur as a 
result of fluctuating background noise, and not from the presence of voice. Therefore, the bias factor jJim) is derived by 
first calculating the variability factor ifK.m) as: 

, , \0.99y/{m-\) + 0mSNR^, SNR<0 
y/{m) = \ ^^ (4.13) 

[ y/{m — 1) otherwise 

which is then clamped in magnitude to < y/(m) < 4.0 . In addition, the variability factor is reset to zero when the 

frame count is less than or equal to four (m < 4) or the forced update flag (sec 4.3.10) is set (fupdatejlag == TRUE). 
The bias factor jJim) is then calculated as: 

//(m) = max{l2.0(^(m) -0.65), O} (4.14) 

4.3.7 VAD Decision 

The quantized SNR SNRq is used to determine the respective voice metric threshold v,;,, hangover count h^^,, and burst 
count threshold bf^ parameters: 

{SNR^l K„,=h,^JSNRj, b„^b,^JSNRj (4.15) 

where SNRq is the index of the respective table elements. The VAD decision can then be made according to the 
following pseudocode: 

/* if the voice metric > voice metric threshold*/ 

/* increment burst counter */ 

/* compare counter with threshold */ 

/* set hangover */ 

/* clear burst counter */ 
/* decrement hangover / 
/* check for expired hangover / 



/* hangover not yet expired */ 



Note that two 10 ms subframes are required to determine one VAD decision. The final decision is determined by the 
maximum of two subframe decisions, i.e. 

ifiVAD(m) == ON OR VADim-l) == ON) { 

VAD_flag = TRUE 
} else { 

VAD_flag = FALSE 
} 



if ( v(m) > v,h 


+ M(m)){ 


VAD(m) = 


ON 


b(m) = b(m-l)+ 1 


if ( b(m) > 


b,H){ 


h(m) = 


hen, 


} 




} else { 




b(m) = 




h(m) = h(m-l) -1 


if ( h(m) < 


= 0){ 


VAD(m) = OFF 


h(m) = 





} else { 




VAD(m) = ON 


} 




} 
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4.3.8 Spectral Deviation Estimator 



The spectral deviation estimator is used as a safeguard against erroneous updates of the background noise estimate. If 
the spectral deviation of the input signal is too high, then the background noise estimate update may not be permitted. 
Calculate the estimated log power spectrum as: 



E,,{m,i) = 101ogio(£,,(m,0), 0< / < iV, 



(4.16) 



Then, calculate the estimated spectral deviation between the current power spectrum and the average long-term power 
spectral estimate: 



^E('n)=Y.\EdB('n^i)-EdB('n,i)\ 



(4.17) 



where E^^ (m) is the average long-term power spectral estimate calculated during the previous subframe, as defined in 

Equation 4.20. The initial value of E^^ (m) , however, is defined to be the estimated log power spectrum of subframe 
1, or: 

E^^(m) = E^^(m), m = l (4.18) 

The exponential windowing factor, a(m), is then calculated as a function of the instantaneous frame SNR SNR and the 
long-term peak SNR SNRp(m), as: 



a(m) = afj 



^ , .SNR(m)-SNR^ 



SNR^im) 

which is then limited to a^ < a(m) < a^j . 

The average long-term power spectral estimate is then updated for the next frame by: 

Ejj^(m + l,i) = a(m)Ejg(m,i) + (l-a(m))Ejg(m,i), 0<i<N^ 

where all the variables are previously defined. 

4.3.9 Sinewave Detection 

Next the sinewave JT.ag is set TRUE when the spectral peak-to-average ratio <j){m) is greater than 10, i.e. 
\ TRUE, ^(m) > 10 



(4.19) 



(4.20) 



sinewavejlag 



where: 



^(m) = lOlog 



10 



[false, otherwise 



max{£^^(m,/)} 

Y]:^E^,{m,j)IN^ 



(4.21) 



2<i<N^ 



(4.22) 
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4.3.10 Background Noise Update Decision 

The following logic, as shown in pseudo-code, demonstrates how the noise estimate update decision is ultimately made: 

/* Normal update logic */ 

update Jlag = fupdate Jlag = FALSE 

if ( v(m) < UPDATE_THLD and b{m) == ) { 

update_flag = TRUE 

update_cnt = 
} 

/* Forced update logic (for over-riding the normal update logic)*/ 
else if (( E,„, > NOISE_FLOOR) and ( A£(m) < DEV_THLD ) 
and ( sinewavejlag == FALSE ) and {UP Jlag == FALSE)) { 

update_cnt = update_cnt + 1 

if ( update_cnt > UPDATE_CNT_THLD ) 
update Jlag = fupdate Jlag = TRUE 
} 

/* "Hysteresis" logic to prevent long-term creeping of update_cnt */ 

if ( update_cnt == last_update_cnt ) 

hyster_cnt = hyster_cnt + 1 
else 

hyster_cnt = 
last_update_cnt = update_cnt 
if ( hyster_cnt > HYSTER_CNT_THLD ) 

update_cnt = 

where E,„, is the total channel energy defined as: 

iV,-! 

K. = T.^ck(m,i) (4.23) 

i=0 

and LTPJag is generated by the comparison of the long-term prediction gain to a constant threshold LTP_THLD, i.e.: 

[TRUE, B>]JIV THLD 

LTP_flag=\ ^ " (4.24) 

[FALSE, otherwise 

where the long-term prediction gain /^ is derived from the speech encoder [2] open-loop pitch predictor, and can be 
expressed as: 

^= ^"-" (4.25) 

where Sn(n) is the weighted speech, k is the optimal open-loop lag, and A'^, is the pitch analysis frame length. This 
expression is calculated in the speech encoder on the previous frame. 

4.3.10 Bacl^ground Noise Estimate Update 

If (and only if) the update flag is set (update Jag == TRUE), then update the channel noise estimate for the next 
subframe by: 

E„(m + l,i) = max{E^^,a^E^(m,i) + (l-a„)E^,(m,i)l 0<i<N^ (4.26) 

where E^^^ is the minimum allowable channel energy, and a„ is the channel noise smoothing factor. The channel noise 
estimate shall be initialized for each of the first four frames to the estimated channel energy, i.e.: 
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£„(m,/) = max{£,.„,,,£^,(m,/)}, m<4, 0</<A^,, (4.27) 

where £;„„ is the minimum allowable channel noise initialization energy. 

5 Computational details 

A low level description has been prepared in form of ANSI C source code [1]. 
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